Search CORE

29 research outputs found

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Author: Chiang Wei-Lin
Gonzalez Joseph E.
Stoica Ion
Yang Shuo
Zheng Lianmin
Publication venue
Publication date: 11/11/2023
Field of study

Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures. Furthermore, we demonstrate that if such variation of test data is not eliminated, a 13B model can easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4. We validate such observations in widely used benchmarks such as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a stronger LLM-based decontamination method and apply it to widely used pre-training and fine-tuning datasets, revealing significant previously unknown test overlap. For example, in pre-training sets such as RedPajama-Data-1T and StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps. Interestingly, we also find such contamination in synthetic dataset generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We urge the community to adopt stronger decontamination approaches when using public benchmarks. Moreover, we call for the community to actively develop fresh one-time exams to evaluate models accurately. Our decontamination tool is publicly available at https://github.com/lm-sys/llm-decontaminator

arXiv.org e-Print Archive

On Optimal Caching and Model Multiplexing for Large Model Inference

Author: Barrett Clark
Jiao Jiantao
Jordan Michael I.
Sheng Ying
Zheng Lianmin
Zhu Banghua
Publication venue
Publication date: 03/06/2023
Field of study

Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to

50\times

improvement over the baseline when the ratio between the maximum cost and minimum cost is

100

. Experiments on real datasets show a

4.3\times

improvement in FLOPs over the baseline when the ratio for FLOPs is

10

, and a

1.8\times

improvement in latency when the ratio for average latency is

1.85

arXiv.org e-Print Archive

Overestimation of thermal emittance in solenoid scans due to coupled transverse motion

Author: Conde Manoel
Doran Scott
Du Yingchao
Gai Wei
Jing Chunguang
Liu Wanming
Power John G.
Shao Jiahang
Tang Chuanxiang
Whiteford Charles E.
Wisniewski Eric E.
Zheng Lianmin
Publication venue: 'American Physical Society (APS)'
Publication date: 01/12/2018
Field of study

The solenoid scan is a widely used method for the in-situ measurement of the thermal emittance in a photocathode gun. The popularity of this method is due to its simplicity and convenience since all rf photocathode guns are equipped with an emittance compensation solenoid. This paper shows that the solenoid scan measurement overestimates the thermal emittance in the ordinary measurement configuration due to a weak quadrupole field (present in either the rf gun or gun solenoid) followed by a rotation in the solenoid. This coupled transverse dynamics aberration introduces a correlation between the beam's horizontal and vertical motion leading to an increase in the measured 2D transverse emittance, thus the overestimation of the thermal emittance. This effect was systematically studied using both analytic expressions and numerical simulations. These studies were experimentally verified using an L-band 1.6-cell rf photocathode gun with a cesium telluride cathode, which shows a thermal emittance overestimation of 35% with a rms laser spot size of 2.7 mm. The paper concludes by showing that the accuracy of the solenoid scan can be improved by using a quadrupole magnet corrector, consisting of a pair of normal and skew quadrupole magnets.Comment: 12 pages, 13 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

Factorized Q-Learning for Large-Scale Multi-Agent Systems

Author: Claus Caroline
Foerster Jakob N.
HolmesParker Chris
Jelle
Lample Guillaume
Littman Michael L.
Lowe Ryan
Tesauro Gerald
van Hasselt Hado
van Hasselt Hado
Wang Ziyu
Watkins Christopher J. C. H.
Yang Yaodong
Zheng Lianmin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/10/2019
Field of study

Deep Q-learning has achieved significant success in single-agent decision making tasks. However, it is challenging to extend Q-learning to large-scale multi-agent scenarios, due to the explosion of action space resulting from the complex dynamics between the environment and the agents. In this paper, we propose to make the computation of multi-agent Q-learning tractable by treating the Q-function (w.r.t. state and joint-action) as a high-order high-dimensional tensor and then approximate it with factorized pairwise interactions. Furthermore, we utilize a composite deep neural network architecture for computing the factorized Q-function, share the model parameters among all the agents within the same group, and estimate the agents' optimal joint actions through a coordinate descent type algorithm. All these simplifications greatly reduce the model complexity and accelerate the learning process. Extensive experiments on two different multi-agent problems demonstrate the performance gain of our proposed approach in comparison with strong baselines, particularly when there are a large number of agents.Comment: 7 pages, 5 figures, DAI 201

arXiv.org e-Print Archive

Crossref